Skip to content

fix(cluster): strip signing_key from /api/cluster/workers#296

Merged
jaylfc merged 2 commits into
masterfrom
fix/cluster-workers-signing-key-leak
May 1, 2026
Merged

fix(cluster): strip signing_key from /api/cluster/workers#296
jaylfc merged 2 commits into
masterfrom
fix/cluster-workers-signing-key-leak

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented May 1, 2026

Summary

  • `asdict(WorkerInfo)` includes the worker's raw HMAC `signing_key: bytes`. FastAPI's encoder utf-8-decodes bytes fields and crashes on non-utf8 key material. Result: `/api/cluster/workers` returns 500 every call, the cluster widget on the dashboard goes blank, the Fedora worker disappears from the UI even though heartbeats are flowing.
  • Strip `signing_key` from the response. The HMAC secret has no place on the wire even when serialization works.

Repro (pre-fix)

  1. Have any worker registered (Pi controller + Fedora worker)
  2. Open the dashboard or hit `GET /api/cluster/workers` directly
  3. 500 `UnicodeDecodeError` on every call

Test plan

  • Reproduced on Pi
  • Lint clean
  • After merge + Pi rebuild, Fedora worker reappears in cluster widget

Summary by CodeRabbit

Release Notes

  • New Features

    • Agents now receive personalized system instructions that adapt based on channel type and team member roles, enabling more context-aware responses.
  • Bug Fixes

    • Removed sensitive credentials from worker API responses to enhance security.
  • Tests

    • Added extensive test coverage for agent instruction generation and message routing context handling.

jaylfc added 2 commits May 1, 2026 16:31
…ispatch

Adds agent_manual.py with build_manual() — a pure function that returns a
per-agent operating manual (≤500 tokens) covering @-mention routing, project
task verbs, and lead/non-lead status. The router prepends it as role:system
before the conversation history on every chat dispatch so agents always know
the channel primitives without per-session briefs.
asdict(WorkerInfo) dumps the worker's raw HMAC signing_key bytes into
the response. FastAPI's default jsonable_encoder maps bytes via
o.decode() with implicit utf-8 — random key material has bytes outside
the utf-8 range, so the entire endpoint 500s. Visible symptom: workers
disappear from the cluster widget on the dashboard because the list
endpoint never succeeds.

Beyond the crash, leaking the HMAC signing key over the API has no
legitimate purpose. Strip it from the response.

Confirmed on Pi: GET /api/cluster/workers was returning 500 with
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2..." every
poll. Fedora worker (192.168.6.108) heartbeats are fine — the bug was
purely in the list-side serialization.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

This pull request introduces an agent-specific system-prompt injection feature via a new build_manual module that generates conditional operating manual text based on channel type and lead membership. The agent chat router now derives lead lists from channel settings and prepends generated manuals to per-agent contexts. A minor security fix strips signing keys from the worker API response.

Changes

Cohort / File(s) Summary
Agent Manual System
tinyagentos/agent_manual.py, tinyagentos/agent_chat_router.py
New build_manual function generates markdown system prompts with conditional sections (routing guidance, kanban verbs, lead/non-lead instructions) based on channel type and project context. Router integrates by deriving leads from settings and injecting the manual as a prepended system message.
Agent Manual Tests
tests/test_agent_manual.py, tests/test_agent_chat_router.py
New test suites verify manual generation for DM/group/project channels, validate lead vs. non-lead branching, enforce token-budget constraints (≤200 words for DM, ≤500 for project), and confirm router correctly injects the manual as the first system message in agent context.
Worker API Security
tinyagentos/routes/cluster.py
list_workers endpoint now strips raw signing_key from serialized worker data to prevent serialization failures and credential exposure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through channels bright,
Crafting manuals left and right,
Leads and non-leads each get their way,
System prompts guide agents every day! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'fix(cluster): strip signing_key from /api/cluster/workers' directly describes the main change in tinyagentos/routes/cluster.py. However, the changeset includes four additional file changes (two new test files for agent_manual and agent_chat_router, and one modification to agent_chat_router.py) that are unrelated to the cluster fix and not reflected in the title. Either split this into separate PRs (cluster fix vs. agent manual/router tests), or update the title to reflect all major changes (e.g., 'feat: add agent manual system and cluster signing_key fix').
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cluster-workers-signing-key-leak

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented May 1, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (5 files)
  • tests/test_agent_chat_router.py - Added tests for system context injection
  • tests/test_agent_manual.py - Comprehensive tests for agent manual generation
  • tinyagentos/agent_chat_router.py - Integrated manual injection into routing logic
  • tinyagentos/agent_manual.py - New module for building operating manuals
  • tinyagentos/routes/cluster.py - Fixed signing key exposure in API response

Reviewed by grok-code-fast-1:optimized:free · 55,589 tokens

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tinyagentos/routes/cluster.py (2)

19-33: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix mutable default values in WorkerRegister (shared state across requests).

WorkerRegister defines mutable defaults:

  • hardware: dict = {} (Line 22)
  • backends: list[dict] = [] (Line 23)
  • models: list[str] = [] (Line 24)
  • capabilities: list[str] = [] (Line 25)

Even if Pydantic often copies defaults, relying on that is risky: mutable defaults can lead to cross-request data leakage / unexpected mutation when any code mutates these values.

Suggested fix: switch to Field(default_factory=...) for each mutable field.

🛠️ Proposed fix
-from pydantic import BaseModel
+from pydantic import BaseModel, Field

 class WorkerRegister(BaseModel):
     name: str
     url: str
-    hardware: dict = {}
-    backends: list[dict] = []
-    models: list[str] = []
-    capabilities: list[str] = []
+    hardware: dict = Field(default_factory=dict)
+    backends: list[dict] = Field(default_factory=list)
+    models: list[str] = Field(default_factory=list)
+    capabilities: list[str] = Field(default_factory=list)
     platform: str = ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tinyagentos/routes/cluster.py` around lines 19 - 33, WorkerRegister uses
mutable default values for hardware, backends, models, and capabilities which
can cause shared state; update these fields to use Pydantic's Field with
default_factory (e.g., Field(default_factory=dict) for hardware and
Field(default_factory=list) for backends/models/capabilities) and import Field
from pydantic, leaving other fields and kv_* defaults unchanged so each model
instance gets its own fresh collection.

66-89: ⚠️ Potential issue | 🟠 Major

API contract mismatch: list_workers returns bare array, but consumer expects wrapped object.

The endpoint returns a bare JSON array, but tinyagentos/scheduling/resource_manager.py (line 112) calls .get("workers", []) expecting an object shape {"workers": [...]}. When the actual list response is received, .get() fails silently due to exception handling and returns [], causing cluster workers to never be discovered during remote resource checks.

Fix: In resource_manager.py line 112, change resp.json().get("workers", []) to resp.json() to match the actual endpoint response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tinyagentos/routes/cluster.py` around lines 66 - 89, The remote workers API
(list_workers) returns a bare JSON array but the client is calling
resp.json().get("workers", []), causing discovery to fail; locate the call that
uses resp.json().get("workers", []) (the code that fetches remote cluster
workers) and replace that expression with resp.json() so the client treats the
response as the list itself (ensure the returned value is assigned to the
variable used as the workers list).
🧹 Nitpick comments (1)
tinyagentos/routes/cluster.py (1)

81-87: ⚡ Quick win

Avoid mutating worker state during a GET response (unless intentional + safe).

Inside list_workers, when registry is not None, the code both:

  • computes tier_id / potential_capabilities for the response (Lines 82-85), and
  • mutates in-memory worker state to “keep WorkerInfo fields in sync” (Lines 86-87).

If multiple requests come in concurrently (or other code relies on these fields being updated only during heartbeat/registration), this can cause subtle racey/ordering behavior and makes GET have side effects.

If you want to persist these derived fields, consider moving that update to the heartbeat/registration path; otherwise compute only into d without touching w.tier_id / w.potential_capabilities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tinyagentos/routes/cluster.py` around lines 81 - 87, In list_workers, avoid
mutating in-memory worker state during a GET: remove the assignments to
w.tier_id and w.potential_capabilities so the handler only computes tier_id and
pot_caps via _potential_capabilities(hardware, registry) and writes them into
the response dict d; if you need to persist these derived fields instead, move
the w.tier_id/w.potential_capabilities updates into the worker
heartbeat/registration code path (where WorkerInfo is intentionally updated)
rather than inside list_workers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tinyagentos/routes/cluster.py`:
- Around line 19-33: WorkerRegister uses mutable default values for hardware,
backends, models, and capabilities which can cause shared state; update these
fields to use Pydantic's Field with default_factory (e.g.,
Field(default_factory=dict) for hardware and Field(default_factory=list) for
backends/models/capabilities) and import Field from pydantic, leaving other
fields and kv_* defaults unchanged so each model instance gets its own fresh
collection.
- Around line 66-89: The remote workers API (list_workers) returns a bare JSON
array but the client is calling resp.json().get("workers", []), causing
discovery to fail; locate the call that uses resp.json().get("workers", []) (the
code that fetches remote cluster workers) and replace that expression with
resp.json() so the client treats the response as the list itself (ensure the
returned value is assigned to the variable used as the workers list).

---

Nitpick comments:
In `@tinyagentos/routes/cluster.py`:
- Around line 81-87: In list_workers, avoid mutating in-memory worker state
during a GET: remove the assignments to w.tier_id and w.potential_capabilities
so the handler only computes tier_id and pot_caps via
_potential_capabilities(hardware, registry) and writes them into the response
dict d; if you need to persist these derived fields instead, move the
w.tier_id/w.potential_capabilities updates into the worker
heartbeat/registration code path (where WorkerInfo is intentionally updated)
rather than inside list_workers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c21cb44a-bd03-4e24-b789-2023963cceb1

📥 Commits

Reviewing files that changed from the base of the PR and between 5aaf8e3 and 9b4154e.

📒 Files selected for processing (5)
  • tests/test_agent_chat_router.py
  • tests/test_agent_manual.py
  • tinyagentos/agent_chat_router.py
  • tinyagentos/agent_manual.py
  • tinyagentos/routes/cluster.py

@jaylfc jaylfc merged commit c33d764 into master May 1, 2026
8 checks passed
@jaylfc jaylfc deleted the fix/cluster-workers-signing-key-leak branch May 1, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant